The objective of the course is to become familiar with (R) programming, and it was recommended by one of my colleagues. I learned how to install and configure the software necessary for statistical programming. I faced some difficulties in syncing files with GitHub (sometimes it works and sometimes it gives me error messages). I hope the situation will be much easier going forward. I wish the course covered practical issues in statistical computing, R programming, reading data, and writing functions. The R for Health Data Science book is very simple and to the point. It takes some time to use Exercise Set 1 with the book, but I liked the idea of having all codes in R chunks indexed in the file. Some points in the “summarizing the data” chapter were difficult and took time to understand.
My GitHub repository: https://github.com/elmoazen/IODS-project .
My Link to course diary: https://elmoazen.github.io/IODS-project .
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Tue Dec 13 00:48:45 2022"
1+1
## [1] 2
The text continues here.
Describe the work you have done this week and summarize your learning.
# Access tidyverse package
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
# Access GGally library
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
# Read the CSV file
learning2014<-read_csv("learning2014.csv")
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim(learning2014) #Explore the dimensions of the data
## [1] 166 7
str(learning2014) #Explore the structure of the data
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ gender : chr [1:166] "F" "M" "F" "M" ...
## $ age : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
## $ points : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
## - attr(*, "spec")=
## .. cols(
## .. gender = col_character(),
## .. age = col_double(),
## .. attitude = col_double(),
## .. deep = col_double(),
## .. stra = col_double(),
## .. surf = col_double(),
## .. points = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
# graphical overview of the data.
ggpairs(learning2014, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))
# a scatter plot of points versus attitude
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")+ggtitle("Relation Between the Attitude and Exam Points") + theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"))
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## `geom_smooth()` using formula = 'y ~ x'
# a scatter plot of points versus deep learning
qplot(deep, points, data = learning2014) + geom_smooth(method = "lm")+ggtitle("Relation Between the Deep Learning and Exam Points")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"))
## `geom_smooth()` using formula = 'y ~ x'
# a scatter plot of points versus Strategic learning
qplot(stra, points, data = learning2014) + geom_smooth(method = "lm")+ggtitle("Relation Between the Strategic Learning and Exam Points")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"))
## `geom_smooth()` using formula = 'y ~ x'
# a scatter plot of points versus surface learning
qplot(surf, points, data = learning2014) + geom_smooth(method = "lm")+ggtitle("Relation Between the Surface Learning and Exam Points")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"))
## `geom_smooth()` using formula = 'y ~ x'
# summaries of the variables in the data
summary(learning2014)
## gender age attitude deep
## Length:166 Min. :17.00 Min. :1.400 Min. :1.583
## Class :character 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333
## Mode :character Median :22.00 Median :3.200 Median :3.667
## Mean :25.51 Mean :3.143 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083
## Max. :55.00 Max. :5.000 Max. :4.917
## stra surf points
## Min. :1.250 Min. :1.583 Min. : 7.00
## 1st Qu.:2.625 1st Qu.:2.417 1st Qu.:19.00
## Median :3.188 Median :2.833 Median :23.00
## Mean :3.121 Mean :2.787 Mean :22.72
## 3rd Qu.:3.625 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :5.000 Max. :4.333 Max. :33.00
table(learning2014$gender)
##
## F M
## 110 56
reg_Model <- lm(points ~ attitude + deep+ stra , data = learning2014)
# print out a summary of the model
summary(reg_Model)
##
## Call:
## lm(formula = points ~ attitude + deep + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3915 3.4077 3.343 0.00103 **
## attitude 3.5254 0.5683 6.203 4.44e-09 ***
## deep -0.7492 0.7507 -0.998 0.31974
## stra 0.9621 0.5367 1.793 0.07489 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
coefficients(reg_Model)
## (Intercept) attitude deep stra
## 11.3914511 3.5253519 -0.7491996 0.9620827
# the Best fitted model
fit_Model <- lm(points ~ attitude + stra , data = learning2014)
## The signficant fitted model
fit_Model2 <- lm(points ~ attitude , data = learning2014)
intercept : 15.4000 which is the points when all other points are 0.
coefficient of determination : (R-squared= 0.2154) : measure of how close the data are to the fitted line. It represents the proportion of the exam points which is explained by the explanatory variables (attitude, deep and stra). So it is here 21.5%.
Adjusted R-squared : 0.195 it is lower the R-squared because some explanatory variables are not contributing in description of exam points.
The F-test has a very low associated p-value, so there is very strong evidence that NOT all the three regression coefficients are zero.
t-test is used to test association of exam points with each explanatory variable ( t-value obtained by dividing the estimated regression coefficient by the standard error of the estimate)
the associated significance levels Will not indicate the importance of the explanatory variables in all cases. The only significant p value is attitude. But to have the higher adjusted R-squared we use both attitude and strategic learning.
The only significant p value is attitude.
# 4. Fourth Task Summary : of Fitted Model
# print out a summary of the Fitted model
summary(fit_Model)
##
## Call:
## lm(formula = points ~ attitude + stra, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
coefficients(fit_Model)
## (Intercept) attitude stra
## 8.9728998 3.4658096 0.9136517
intercept :8.9729 which is the points when all other points are 0
Multiple R-squared := 0.2048) : measure of how close the data are to the fitted line. It represents the proportion of the exam points which is explained by the explanatory variables (attitude, deep and stra). So it is here 20.5%. lower than the first model as it has less variables.
Adjusted R-squared :0.1951 which it is lower the R-squared because some explanatory variables are not contributing in description of exam points
the relation : Exam point= 8.9729+ (3.466* attitude) + (0.0914 *strategic learning)
# Fifth Task: .
# draw diagnostic plots Residuals vs Fitted values 1, Normal QQ-plot 2 and Residuals vs Leverage 5
par(mfrow = c(2,2))
plot(fit_Model, which=c(1,2,5))
Residual vs Fitted : This plot is used to determine if the residuals exhibit non-linear patterns. If the red line across the center of the plot is roughly horizontal then we can assume that the residuals follow a linear pattern. In our model is almost horizontal so the linear regression model is appropriate for the dataset
Normal Q-Q :This plot is used to determine if the residuals of the regression model are normally distributed. If the points in this plot fall roughly along a straight diagonal line, then we can assume the residuals are normally distributed. In our model most of the points are roughly along the diagonal line
Residual vs Leverage : This plot is used to identify influential observations. If any points in this plot fall outside of Cook’s distance (the dashed lines) then it is an influential observation.
# File 'chapter3.Rmd' is created and linked to index
# access the tidyverse libraries tidyr, dplyr, ggplot2
library(readr); library(dplyr); library(ggplot2); library(tidyr); library(gtsummary)
# Read the CSV file
alc<- read_csv("data/alc.csv")
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl (1): high_use
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
This data is obtained from two secondary education schools in Portugal. it was collected by using school reports and questionnaires.
# Create a logistic regression model
reg_model <- glm(high_use ~ sex +failures + absences + G3, data = alc, family = "binomial")
# print summary of the model
summary(reg_model)
##
## Call:
## glm(formula = high_use ~ sex + failures + absences + G3, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1561 -0.8429 -0.5872 1.0033 2.1393
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.38733 0.51617 -2.688 0.00719 **
## sexM 1.00870 0.24798 4.068 4.75e-05 ***
## failures 0.50382 0.22018 2.288 0.02213 *
## absences 0.09058 0.02322 3.901 9.56e-05 ***
## G3 -0.04671 0.03948 -1.183 0.23671
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 405.59 on 365 degrees of freedom
## AIC: 415.59
##
## Number of Fisher Scoring iterations: 4
# Odds Ratio(OR) computation from coefficients
OR<-coef(reg_model) %>% exp
# compute confidence intervals (CI)
CI<- confint(reg_model)%>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.2497422 0.08880747 0.6760545
## sexM 2.7420380 1.69666006 4.4943946
## failures 1.6550376 1.08094265 2.5808643
## absences 1.0948038 1.04844539 1.1486025
## G3 0.9543647 0.88291082 1.0311936
The variables selected 1. Gender: Hypothesis from student’s sex we can predict the high alcohol consumption. (hypothesis that males will be more alcohol consumer than females) 2. Health: Hypothesis from the students’ health we can predict the high alcohol consumption. (hypothesis that high-use group will have lower health score than low-use group) 3. Absences: from the students’ grades we can predict the high alcohol consumption (hypothesis that high-use group will have more absences than low-use group) 4. G3: Hypothesis from the students’ grades we can predict the high alcohol consumption . (hypothesis that high-use group will have lower score than low-use group)
#Bars of high_use and Sex
ggplot(alc, aes(x = high_use, fill = sex))+ geom_bar() + xlab("Gender")+ ggtitle("Relation between Alcohol Consumption and Gender")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"))
#Summarize high alcohol consumption by gender
alc %>% select(high_use,sex) %>% tbl_summary(by=c("high_use"),percent = "row")
| Characteristic | FALSE, N = 2591 | TRUE, N = 1111 |
|---|---|---|
| sex | ||
| F | 154 (79%) | 41 (21%) |
| M | 105 (60%) | 70 (40%) |
| 1 n (%) | ||
# Boxplot of high_use and Health
ggplot(alc, aes(x = high_use, y = health, col = high_use))+ geom_boxplot() + ggtitle("Relation between Alcohol Consumption and health")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"));
ggplot(alc, aes(x = high_use, y = health, col = sex))+ geom_boxplot() + ggtitle("Relation between Alcohol Consumption and health in both genders")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"))
#Summarize relation between high alcohol consumption and health in both genders
alc %>% group_by(sex, high_use) %>% summarise(mean(health),sd(health));
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups: sex [2]
## sex high_use `mean(health)` `sd(health)`
## <chr> <lgl> <dbl> <dbl>
## 1 F FALSE 3.37 1.40
## 2 F TRUE 3.39 1.55
## 3 M FALSE 3.67 1.41
## 4 M TRUE 3.93 1.28
alc %>% group_by( high_use) %>% summarise(mean(health),sd(health))
## # A tibble: 2 × 3
## high_use `mean(health)` `sd(health)`
## <lgl> <dbl> <dbl>
## 1 FALSE 3.49 1.41
## 2 TRUE 3.73 1.40
The overall health score for high alcohol (3.73) is higher than low consumption (3.49) * The results showed that the mean health score for high-use alcohol in males is more than females (the mean health score for females=3.39 , males= 3.92).
In Males the the difference between low alcohol and high alcohol consumption students was higher in males(low=3.67, high= 3.93)
In Females group: the difference between low alcohol and high alcohol consumption students was small (low=3.37, high=3.39)
The results is different from my hypothesis
# Boxplot of high_use and Absences
ggplot(alc, aes(x = high_use, y = absences, col = high_use))+ geom_boxplot() + ggtitle("Relation between Alcohol Consumption and Abscences")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"));
ggplot(alc, aes(x = high_use, y = absences, col = sex))+ geom_boxplot() + ggtitle("Relation between Alcohol Consumption and Abscences")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"))
#Summarize relation between high alcohol consumption and absences in both genders
alc %>% group_by(sex, high_use) %>% summarise(mean(absences),sd(absences));
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups: sex [2]
## sex high_use `mean(absences)` `sd(absences)`
## <chr> <lgl> <dbl> <dbl>
## 1 F FALSE 4.25 5.29
## 2 F TRUE 6.85 9.40
## 3 M FALSE 2.91 2.67
## 4 M TRUE 6.1 5.29
alc %>% group_by( high_use) %>% summarise(mean(absences),sd(absences))
## # A tibble: 2 × 3
## high_use `mean(absences)` `sd(absences)`
## <lgl> <dbl> <dbl>
## 1 FALSE 3.71 4.46
## 2 TRUE 6.38 7.06
# Boxplot of high_use and G3
ggplot(alc, aes(x = high_use, y = G3, col = high_use))+ geom_boxplot() + ylab("Grades")+ ggtitle("Relation between Alcohol Consumption and Grades")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"));
ggplot(alc, aes(x = high_use, y = G3, col = sex))+ geom_boxplot() + ylab("Grades")+ ggtitle("Relation between Alcohol Consumption and Grades in both Genders")+ theme(plot.title = element_text(hjust = 0.5 , size=15, face="bold", color = "blue"))
#Summarize relation between high alcohol consumption and Grades in both genders
alc %>% group_by(sex, high_use) %>% summarise(mean(G3),sd(G3));alc %>% group_by( high_use) %>% summarise(mean(G3),sd(G3))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups: sex [2]
## sex high_use `mean(G3)` `sd(G3)`
## <chr> <lgl> <dbl> <dbl>
## 1 F FALSE 11.4 3.24
## 2 F TRUE 11.8 2.81
## 3 M FALSE 12.3 3.45
## 4 M TRUE 10.3 3.08
## # A tibble: 2 × 3
## high_use `mean(G3)` `sd(G3)`
## <lgl> <dbl> <dbl>
## 1 FALSE 11.8 3.35
## 2 TRUE 10.9 3.05
# Create a logistic regression model
reg_model <- glm(high_use ~ sex +health + absences + G3, data = alc, family = "binomial")
# print summary of the model
summary(reg_model)
##
## Call:
## glm(formula = high_use ~ sex + health + absences + G3, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2867 -0.8513 -0.6097 1.0669 2.1530
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.20679 0.58419 -2.066 0.0389 *
## sexM 1.02053 0.24766 4.121 3.78e-05 ***
## health 0.06783 0.08880 0.764 0.4450
## absences 0.09262 0.02333 3.970 7.19e-05 ***
## G3 -0.07619 0.03680 -2.070 0.0384 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 410.38 on 365 degrees of freedom
## AIC: 420.38
##
## Number of Fisher Scoring iterations: 4
# Odds Ratio(OR) computation from coefficients
OR<-coef(reg_model) %>% exp
# compute confidence intervals (CI)
CI<- confint(reg_model)%>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.2991572 0.09342207 0.9282541
## sexM 2.7746780 1.71811644 4.5452414
## health 1.0701795 0.90040767 1.2764711
## absences 1.0970499 1.05023715 1.1509892
## G3 0.9266362 0.86143669 0.9956498
- Gender is Males (OR= 2.74)
- Better Health score (OR= 1.07) But it is not signficant
- More absence (OR= 1.09)
- Lower grade (OR= 0.93)
# according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model
# Predict probability of high_use and add it to alc
probabilities <- predict(reg_model, type = "response")
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# Provide a 2x2 cross tabulation of predictions versus the actual values and optionally display a graphic visualizing both the actual values and the predictions
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 247 12
## TRUE 83 28
#plot of 'high_use' versus 'probability' in 'alc'
ggplot(alc, aes(x = probability , y = high_use, col=prediction)) +geom_point()
#Compute the total proportion of inaccurately classified individuals (= the training error) and comment on all the results. Compare the performance of the model with performance achieved by some simple guessing strategy #line 700 in excercise
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2567568
# Create a new R Markdown file and save it as an empty file named ‘chapter4.Rmd’
#Load Mass and corrplot libraries
library (MASS); library (corrplot)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:gtsummary':
##
## select
## The following object is masked from 'package:dplyr':
##
## select
## corrplot 0.92 loaded
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
library(ggplot2)
# This is a so-called "R chunk" where you can write R code.
data("Boston")
#Explore the structure and the dimensions of the data
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
These data show the Housing Values in Suburbs of Boston through 506 observations of 14 variables.
The data contains the following:
# Calculate co relation
cor_matrix <- cor(Boston)
#round(cor_matrix,digit=2)
cor_matrix %>% round(digits=2)
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
corrplot(cor_matrix, method="circle")
There is high variety in correlation among the variables : 1. High positive correaltion as rad and tax = 0.91, indus and tax = 0.72 2. High negative correlation as: dis & nox= -0.77, dis & age= -0.75 and dis & indus = -0.71 3. Moderate correlation as rad&crime=0.63, indus&rad= 0.60 4. Moderate correaltion rm &istat =-0.61 and age&zn= -0.57 5. Almost no correlation between some variables as zn&chas==-0.04 and chas&crime =-0.06,
# standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled<-as.data.frame(boston_scaled)
boston_scaled$crim <- as.numeric(boston_scaled$crim)
#summary of Crim
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)
# look at the table of the new factor crime
table(crime)
## crime
## [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 127 126 126 127
# create labels for crime
labels_crime <- c("low", "med_low", "med_high", "high")
# Same step with createing categorical variables of the crime rate in the Boston dataset `"low"`, `"med_low"`, `"med_high"`, `"high"`
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = labels_crime)
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# Fit the linear discriminant analysis on the train set.
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2475248 0.2425743 0.2698020 0.2400990
##
## Group means:
## zn indus chas nox rm age
## low 1.09192475 -0.9912578 -0.11484506 -0.9091316 0.4689974 -0.9337820
## med_low -0.09281428 -0.2633801 -0.07145661 -0.5430794 -0.1114517 -0.3448286
## med_high -0.37552361 0.1754278 0.19723335 0.3516234 0.1412393 0.3772003
## high -0.48724019 1.0149946 -0.02879709 1.0504838 -0.4008802 0.8172071
## dis rad tax ptratio black lstat
## low 0.9115221 -0.6706366 -0.7394641 -0.50510906 0.38358421 -0.79864362
## med_low 0.3405493 -0.5564696 -0.4805604 -0.08880976 0.30948775 -0.13341954
## med_high -0.3550392 -0.4013159 -0.2877350 -0.21888875 0.08909738 -0.06684933
## high -0.8630134 1.6596029 1.5294129 0.80577843 -0.71565108 0.81690022
## medv
## low 0.59249016
## med_low 0.01130012
## med_high 0.20222172
## high -0.65370444
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12253775 0.8031322109 -0.84083986
## indus 0.03557388 -0.4098911919 0.47762064
## chas -0.09215050 0.0086198852 0.01101208
## nox 0.33657516 -0.7213826926 -1.48004910
## rm -0.10951879 -0.1306524165 -0.13129028
## age 0.28902937 -0.2093621014 -0.17441950
## dis -0.08865629 -0.3491559235 0.17167335
## rad 3.23880051 0.9819799036 0.12944735
## tax -0.07112787 -0.0076741853 0.38309652
## ptratio 0.16400758 0.0008162476 -0.34086547
## black -0.17145243 0.0076732967 0.11909976
## lstat 0.15533992 -0.2430963803 0.36092175
## medv 0.18066476 -0.3568682907 -0.21414092
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9424 0.0446 0.0130
# Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables.
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
#Draw the LDA (bi)plot.
# plot the lda results
plot(lda.fit, dimen = 2); lda.arrows(lda.fit, myscale = 1)
Prior probabilities of groups: low= 24.5% med_low=25.2% med_high= 25%
high=25.2% 0.2450495 0.2524752 0.2500000 0.2524752 The percentage
separation achieved by each linear discriminant function are for LD1:
94.4 %, LD2: 4.2 % and for LD3: 1.4 %
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 10 14 3 0
## med_low 3 19 6 0
## med_high 0 3 14 0
## high 0 0 1 29
Prior probabilities of groups: low= 24.5% med_low=25.2% med_high= 25% high=25.2% 0.2450495 0.2524752 0.2500000 0.2524752 The percentage separation achieved by each linear discriminant function are for LD1: 94.4 %, LD2: 4.2 % and for LD3: 1.4 %
#Reload the Boston dataset and standardize the dataset
# load the data
data("Boston")
#scale dataset
boston_scaled <- scale(Boston, center = TRUE, scale = TRUE)
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
#calculate distance between observations
dist_eu <- dist(boston_scaled, method = "euclidean")
#summary of distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method = "manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
#Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again
# k-means clustering
km <- kmeans(Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <- kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
The optimal number of clusters is when the value of total WCSS changes radically. In this case, two clusters would seem optimal
# k-means clustering =5
km_2 <- kmeans(boston_scaled, centers = 5)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km_2$cluster)
# linear discriminant analysis
lda.fit_2 <- lda(km_2$cluster ~ ., data = boston_scaled)
# print the LDA.fit_2
lda.fit_2
## Call:
## lda(km_2$cluster ~ ., data = boston_scaled)
##
## Prior probabilities of groups:
## 1 2 3 4 5
## 0.05731225 0.13043478 0.26482213 0.24901186 0.29841897
##
## Group means:
## crim zn indus chas nox rm
## 1 3.0022987 -0.48724019 1.01499462 -0.27232907 1.0593345 -1.3064650
## 2 -0.4129151 2.21369458 -1.14073862 -0.21267603 -1.1776883 0.6268583
## 3 -0.3927690 -0.05350915 -0.72731163 0.05086573 -0.5787393 0.6963941
## 4 0.3838378 -0.48724019 1.10691918 0.10263286 1.1951864 -0.2747091
## 5 -0.3678595 -0.41994296 0.02544269 0.01447956 -0.1724217 -0.4118452
## age dis rad tax ptratio black lstat
## 1 0.9805356 -1.0484716 1.6596029 1.5294129 0.8057784 -1.1906614 1.8708759
## 2 -1.4768042 1.6629608 -0.6477717 -0.5460224 -0.7465036 0.3461285 -0.9418899
## 3 -0.4958014 0.2904872 -0.5584811 -0.7610166 -0.5668389 0.3659840 -0.7643523
## 4 0.7423860 -0.8065257 1.2549050 1.3202837 0.4886767 -0.5898468 0.6409434
## 5 0.2776840 -0.1102834 -0.5871332 -0.4814240 0.2667868 0.2447916 0.1958522
## medv
## 1 -1.3102002
## 2 0.7058131
## 3 0.7824526
## 4 -0.5359120
## 5 -0.3040503
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4
## crim -0.427666835 -0.199511558 -1.53952452 -0.4079654643
## zn 0.145299886 -1.666731999 -0.13266159 0.8941701291
## indus -0.344632503 0.007043196 0.31049852 0.4203611322
## chas -0.078704824 0.058245348 0.03305941 0.0762455497
## nox -1.122343755 -0.742824398 0.62047881 -0.1461192814
## rm 0.460578250 -0.082860834 0.46972900 -0.3594922840
## age -0.004009447 0.638346347 -0.09833009 0.5748989064
## dis -0.007017325 -0.592461290 0.20900264 -0.0006883399
## rad -1.203224091 0.019174331 0.63799168 -0.9675794050
## tax -0.827069577 -1.161729561 0.30529810 0.3353654284
## ptratio -0.120673757 0.077914970 -0.07831677 0.5866415075
## black 0.121519360 0.054015976 0.06837498 -0.0012996769
## lstat -0.434053312 -0.108659705 -0.55646297 0.0223022252
## medv -0.145642897 -0.282461279 -0.38632976 -0.3660128703
##
## Proportion of trace:
## LD1 LD2 LD3 LD4
## 0.7362 0.1819 0.0571 0.0248
# plot the lda results
plot(lda.fit_2, dimen = 2, pch = km_2$cluster)
lda.arrows(lda.fit_2, myscale = 2)
# Install Packages
library(GGally)
# Read the CSV file
human<- read.csv("data/human.csv",row.names=1)
#summary of variables
summary(human)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
#Graphical overview
ggpairs(human, progress = FALSE )
Data for all data are skewed either positive or negative except Education experience
The correlation between variables are varied:
positive high correlation: between life exp & edu Exp, life exp & GNI, Exdu Ex & GNI and Ado Birth & Mat.Mor
Negative high corelation: between life exp & Mat. Mor, life exp & Ado Birth, Mat Mor & Edu 2 FM
Moderate positive between (edu EXp & Edu 2 Fm , life EXp & Edu 2 FM, GNI & Edu2 FM,)
Moderate negative correlation between (ADo birth& Edu2FM, GNI & Mat Mor, and GNI & MatMor)
parli. F. is the least correlated to all the other factors.
# Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)
# perform principal component analysis (with the SVD method)
pca_non<-prcomp(human)
#summary of PCA
summary(pca_non)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
# draw a biplot of the principal component representation and the original variables
biplot(pca_non, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = NA, ylab = NA)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
# Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to. (0-4 points)
# standardize the variables
human_std <- scale(human)
# print out summaries of the standardized variables
summary (human_std)
## Edu2.FM Labo.FM Life.Exp Edu.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.3503 Median : 0.2316 Median : 0.3056 Median : 0.1140
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 2.6646 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
#summary of PCA
summary(pca_human)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
summary(pca_non)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = NA, ylab = NA)
If the variables are in different units, it’s recommended to standardize
data because a change in units will change the results and it’s hard to
see.
# Give your personal interpretations of the first two principal component dimensions based on the biplot drawn after PCA on the standardized human data. (0-2 points)
Stardaizing the data produce a more clear visualization of all variations in PCA
Proportions of variance and cumulative proportion can now interpretated easily
arrows are now visible to show relation between variables
# Load library
library(dplyr)
library(tidyr)
# load data
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(keep_columns)
##
## # Now:
## data %>% select(all_of(keep_columns))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
# look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free")+geom_bar()+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
# Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots. (0-4 points)
# multiple correspondence analysis
library(FactoMineR)
mca <- MCA(tea_time[, 1:5], graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time[, 1:5], graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.335 0.309 0.257 0.224 0.198 0.184 0.172
## % of var. 16.762 15.441 12.866 11.181 9.924 9.188 8.625
## Cumulative % of var. 16.762 32.204 45.069 56.250 66.174 75.362 83.987
## Dim.8 Dim.9 Dim.10
## Variance 0.141 0.105 0.075
## % of var. 7.031 5.249 3.734
## Cumulative % of var. 91.017 96.266 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.325 0.105 0.088 | -0.288 0.090 0.069 | 0.305
## 2 | -0.259 0.067 0.037 | -0.074 0.006 0.003 | 0.789
## 3 | -0.403 0.162 0.242 | -0.283 0.086 0.119 | 0.217
## 4 | -0.580 0.334 0.481 | -0.328 0.116 0.154 | -0.290
## 5 | -0.403 0.162 0.242 | -0.283 0.086 0.119 | 0.217
## 6 | -0.403 0.162 0.242 | -0.283 0.086 0.119 | 0.217
## 7 | -0.403 0.162 0.242 | -0.283 0.086 0.119 | 0.217
## 8 | -0.259 0.067 0.037 | -0.074 0.006 0.003 | 0.789
## 9 | 0.155 0.024 0.012 | 0.974 1.025 0.461 | -0.005
## 10 | 0.521 0.270 0.142 | 0.845 0.770 0.373 | 0.612
## ctr cos2
## 1 0.120 0.078 |
## 2 0.806 0.343 |
## 3 0.061 0.070 |
## 4 0.109 0.120 |
## 5 0.061 0.070 |
## 6 0.061 0.070 |
## 7 0.061 0.070 |
## 8 0.806 0.343 |
## 9 0.000 0.000 |
## 10 0.485 0.196 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.294 0.073 4.681 | 0.198 0.627 0.013
## Earl Grey | -0.265 2.689 0.126 -6.147 | 0.088 0.319 0.014
## green | 0.487 1.558 0.029 2.962 | -0.956 6.515 0.113
## alone | -0.017 0.011 0.001 -0.405 | -0.251 2.643 0.117
## lemon | 0.668 2.926 0.055 4.059 | 0.458 1.497 0.026
## milk | -0.338 1.428 0.030 -3.010 | 0.220 0.659 0.013
## other | 0.287 0.148 0.003 0.873 | 2.207 9.465 0.151
## tea bag | -0.607 12.466 0.482 -12.008 | -0.336 4.134 0.147
## tea bag+unpackaged | 0.348 2.261 0.055 4.063 | 1.000 20.281 0.456
## unpackaged | 1.959 27.484 0.524 12.511 | -1.025 8.173 0.143
## v.test Dim.3 ctr cos2 v.test
## black 1.961 | 1.045 20.945 0.358 10.342 |
## Earl Grey 2.033 | -0.462 10.689 0.386 -10.737 |
## green -5.813 | 0.360 1.110 0.016 2.190 |
## alone -5.905 | 0.150 1.136 0.042 3.534 |
## lemon 2.787 | -1.561 20.842 0.301 -9.491 |
## milk 1.963 | 0.093 0.141 0.002 0.828 |
## other 6.712 | 1.826 7.773 0.103 5.552 |
## tea bag -6.637 | 0.069 0.211 0.006 1.367 |
## tea bag+unpackaged 11.677 | -0.050 0.061 0.001 -0.586 |
## unpackaged -6.548 | -0.196 0.357 0.005 -1.249 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.115 0.421 |
## How | 0.076 0.220 0.385 |
## how | 0.708 0.503 0.008 |
## sugar | 0.065 0.004 0.412 |
## where | 0.701 0.702 0.061 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")
The plot above helps to identify variables that are the most correlated
with each dimension in different categories:
People drink green tea which is unpackaged from a tea shop
People drink earl grey tea in the form of tea bags from chain store
People drink earl grey tea in with milk and sugar
People drink black tea with no sugar and with lemon
Other catogeries unidentified (chain anfd tea shops) with (teabags and unpacked)
# Load packages
library(readr)
library(dplyr)
library(ggplot2)
RATSL <- read.csv("data/RATSL.csv")
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
# Convert categorical variables in RATS to factors.
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
# Glimpse the data
glimpse(RATSL)
## Rows: 176
## Columns: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2, 3,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 1, …
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", …
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, 470…
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, …
# Draw RATSL plot
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:8, times=2)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
# standardise RATSL variable
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdWeight = (Weight - mean(Weight))/sd(Weight)) %>%
ungroup()
# Glimpse the data
glimpse(RATSL)
## Rows: 176
## Columns: 6
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1, 2,…
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, …
## $ WD <chr> "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1", "WD1…
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, 555, …
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, …
## $ stdWeight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8819001, …
# Plot standardized RATSL
ggplot(RATSL, aes(x = Time, y = stdWeight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:8, times=2)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized RATSL")
Interpretation * Standardization reduces weight variation. The follow-up
of all treatment groups. Group 2 and Group 3 gain weight more than in
group 1. Group 2 show outlier in its readings
n=8
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise(mean = mean(Weight), se = sd(Weight)/sqrt(n)) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
Interpretation The standard error of mean is shown in this graph which
showed the heighest variance in group 2 and the lowest for group 1 . The
group 3 SE is decreasing by time. ###
# # Create a summary data with mean (ignoring baseline )
RATSLX <- RATSL %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise(mean=mean(Weight)) %>%
ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# summary of the RATSLX
summary(RATSLX)
## Group ID mean
## 1:8 1 : 1 Min. :238.9
## 2:4 2 : 1 1st Qu.:267.4
## 3:4 3 : 1 Median :360.1
## 4 : 1 Mean :386.3
## 5 : 1 3rd Qu.:505.4
## 6 : 1 Max. :594.0
## (Other):10
# box plot for RATSLX
ggplot(RATSLX, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "(Weight)/ days ")
# Comment
# create a new data by removing the oulier
RATSLX2 <- RATSLX %>% filter(mean < 550)
# draw a boxplot of without the outliers
ggplot(RATSLX2, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "(Weight)/ days ")
# Fit the linear model with the mean as the response
fit <- lm(mean ~ Group, data = RATSLX2)
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 207659 103830 501.81 2.721e-12 ***
## Residuals 12 2483 207
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation ANOVA test showed that there is highly signficant difference in the weight of the 3 study groups
# Read BPRSL dataset
BPRSL <- read.csv("data/BPRSL.csv")
# Factor treatment & subject
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)
str(BPRSL)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
text continues here.
# Standardise the variable bprs
BPRSL <- BPRSL %>%
group_by(week) %>%
mutate(stdbprs = (bprs-mean(bprs))/sd(bprs)) %>%
ungroup()
# Glimpse the data
glimpse(BPRSL)
## Rows: 360
## Columns: 6
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ weeks <chr> "week0", "week0", "week0", "week0", "week0", "week0", "week0…
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, 38, 66, …
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stdbprs <dbl> -0.4245908, 0.7076513, 0.4245908, 0.4953559, 1.6983632, 0.00…
# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline week 0)
BPRSL8S <- BPRSL %>%
filter(week > 0) %>%
group_by(treatment, subject) %>%
summarise( mean=mean(bprs) ) %>%
ungroup()
## `summarise()` has grouped output by 'treatment'. You can override using the
## `.groups` argument.
# Glimpse the data
glimpse(BPRSL8S)
## Rows: 40
## Columns: 3
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
## $ mean <dbl> 41.500, 43.125, 35.375, 52.625, 50.375, 34.000, 37.125, 30.5…
# Draw a boxplot of the mean versus treatment
library(ggplot2)
ggplot(BPRSL8S, aes(x = treatment, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(bprs), weeks 1-8")
# create a regression model for BPRS
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
# access library lme4
library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
# summary of the intercept model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
text #### reate a random intercept and random slope model
# create a random intercept and random slope model
BPRS_ref2 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
#summary of model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8222 8.0512
## week 0.9609 0.9802 -0.51
## Residual 97.4305 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
# perform an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref2: bprs ~ week + treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref2 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA test showed a statistically signficance across time . So the new model created is firt better than the first one
# create a random intercept and random slope model with the interaction
BPRS_ref3 <- lmer(bprs ~ week + treatment + week * treatment + (week | subject), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_ref3)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + week * treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0768 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.9964 8.0620
## week 0.9687 0.9842 -0.51
## Residual 96.4707 9.8220
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2521 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
# Comment
text
# perform an ANOVA test on the two models
anova(BPRS_ref3, BPRS_ref2)
## Data: BPRSL
## Models:
## BPRS_ref2: bprs ~ week + treatment + (week | subject)
## BPRS_ref3: bprs ~ week + treatment + week * treatment + (week | subject)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref2 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref3 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Not statistically signficant at p<0.05 . So the third model not differ from the second one
#BPRSL$subject3<- as.numeric(BPRSL$subject)
# draw the plot of BPRSL with the observed Weight values
ggplot(BPRSL, aes(x = week, y = bprs, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)), name = "Observed BPRS")
text
# Create a vector of the fitted values
Fitted <- fitted(BPRS_ref3)
# Create a new column fitted to BPRSL
BPRSL <- BPRSL %>% mutate(Fitted = Fitted)
# draw the plot of BPRSL with the Fitted values of weight
ggplot(BPRSL, aes(x = week, y = Fitted, linetype = subject)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(BPRSL$bprs), max(BPRSL$bprs)), name = "Fitted BPRS")
The fitted model has less individual variations in BPRS